2022-04-25
Missing observations are defined as NA in R.
Missing data can have different implications for data summaries, analyses and conclusions based on the data with missing values.
The example data has 25 rows and 5 columns.
head(datm, 15)
## X1 X2 X3 X4 X5 ## 1 -1.31206814 NA -0.09029564 -0.7441435 NA ## 2 0.59018494 0.5283625 0.21045116 -1.1542145 0.13165123 ## 3 0.50380010 0.7699291 -1.04938718 0.7556022 0.41923246 ## 4 NA -1.0124077 -0.07529569 NA NA ## 5 0.87236847 0.6567027 0.73149931 1.8083644 1.43617693 ## 6 NA -0.9774784 -1.40168157 NA NA ## 7 -0.35453020 -0.4667603 -0.14309087 -0.2408023 -0.19082025 ## 8 1.41439776 NA 1.67916742 1.2121174 NA ## 9 -0.36148186 NA -0.80721413 -1.6152550 NA ## 10 0.01181096 -0.4821782 -0.62479444 -0.4481009 -0.37192083 ## 11 -0.24008933 -1.0062626 -0.60164701 -1.1268334 -0.05493203 ## 12 0.80804238 0.6071861 1.14141883 1.2401857 -0.19844205 ## 13 NA -0.7519924 -0.89651360 NA NA ## 14 -0.12037291 NA -0.44162562 -0.8241755 NA ## 15 -0.81020471 NA -1.63042369 -2.0644673 NA
Matrix perspective: the number of missing entries in the data matrix.
The is.na function returns TRUE if a cell is missing (NA) and FALSE if a cell is observed.
In the example there are 24 missing data entries. The data frame contains 5 variables for 25 subjects, which makes a total of 125 data entries. So, 19.2% of the data entries are missing.
sum(is.na(datm))
## [1] 24
sum(is.na(datm))/length(is.na(datm))
## [1] 0.192
Variables perspective: the number of missing values per variable.
For each variable we can count the number of missing observations (n) and calculate the proportion (p).
datm %>%
is.na %>%
data.frame() %>%
summarise_all(list(n = sum, p = mean)) %>%
pivot_longer(everything(),
names_to = c("variable", ".value"),
names_pattern = "(.*)_(.)")
## # A tibble: 5 x 3 ## variable n p ## <chr> <int> <dbl> ## 1 X1 4 0.16 ## 2 X2 6 0.24 ## 3 X3 0 0 ## 4 X4 4 0.16 ## 5 X5 10 0.4
Case perspective: the number of rows, i.e. cases, with missing values.
Many analysis methods only use the rows that are fully observed: complete-case analysis.
The data are then listwise deleted.
datm %>%
is.na %>%
data.frame() %>%
mutate(n_miss = rowSums(.),
missing = ifelse(n_miss > 0, "rows with misings", "rows without missing")) %>%
group_by(missing) %>%
summarise(n = n(),
p = n/ 25)
## # A tibble: 2 x 3 ## missing n p ## <chr> <int> <dbl> ## 1 rows with misings 10 0.4 ## 2 rows without missing 15 0.6
micecci: create an indicator for the number of fully observed rows.mice::cci(datm)
## [1] FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE ## [13] FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE ## [25] TRUE
nic: count the number of incomplete cases, i.e. cases with missing values.mice::nic(datm)
## [1] 10
ncc: count the number of complete cases, i.e. cases full fully observed rows.mice::ncc(datm)
## [1] 15
Missing data pattern: the combination of observed and unobserved values that occur together in a row. Generally notated as having a 0 for a missing value and a 1 for an observed value.
Data often contains multiple different missing data patterns. The example shows three missing data patterns:
mice::md.pattern(datm, plot= F)
## X3 X1 X4 X2 X5 ## 15 1 1 1 1 1 0 ## 6 1 1 1 0 0 2 ## 4 1 0 0 1 0 3 ## 0 4 4 6 10 24
row-names: the number of times the pattern occurs in the data; last column: the number missing values the missing data pattern holds.
Missing data pair: the number of times two variables are either missing together or observed together.
How many cases we can actually use for imputation. The md.pair function from the mice package returns four matrices. Each matrix gives us information about combinations of missing values in our data.
rr) the count of how often two variables are both observed.rm) the count of how often the row-variable is observed and the column-variable is missing.mr) the count of how often the row-variable is missing and the column-variable is observed.mm) the count of how often two variables are both missing.pat <- mice::md.pairs(datm)
Observed value counts.
pat$rr
## X1 X2 X3 X4 X5 ## X1 21 15 21 21 15 ## X2 15 19 19 15 15 ## X3 21 19 25 21 15 ## X4 21 15 21 21 15 ## X5 15 15 15 15 15
Missing value counts when rows are observed.
pat$rm
## X1 X2 X3 X4 X5 ## X1 0 6 0 0 6 ## X2 4 0 0 4 4 ## X3 4 6 0 4 10 ## X4 0 6 0 0 6 ## X5 0 0 0 0 0
Missing value counts when columns are observed.
pat$mr
## X1 X2 X3 X4 X5 ## X1 0 4 4 0 0 ## X2 6 0 6 6 0 ## X3 0 0 0 0 0 ## X4 0 4 4 0 0 ## X5 6 4 10 6 0
Missing value counts.
pat$mm
## X1 X2 X3 X4 X5 ## X1 4 0 0 4 4 ## X2 0 6 0 0 6 ## X3 0 0 0 0 0 ## X4 4 0 0 4 4 ## X5 4 6 0 4 10
The proportion missing-response from the sum of the missing-response and missing-missing matrices shows how many usable cases the data have to impute the row variable from the column variable.
round(100 * pat$mr / (pat$mr + pat$mm))
## X1 X2 X3 X4 X5 ## X1 0 100 100 0 0 ## X2 100 0 100 100 0 ## X3 NaN NaN NaN NaN NaN ## X4 0 100 100 0 0 ## X5 60 40 100 60 0
X3 has no missing values